Intro

NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.

Data

Goal is to use this data to categorize risk of death by day of the week on a county level.

Underlying cause of death 2018-21

  • Data was pulled from the CDC website by grouping by County, Month, and Weekday.
    • Additional filtering/grouping was avoided (when additional features were added, such as place or cause of death, results were missing too many values).
  • Weekdays with value “Unknown” were removed.
  • Suppressed death counts were added via the CDC to protect privacy, and mask any death values number 1-9.
    • These values were replaced as NA values
  • Unreliable was appended to calculated crude-death-rates if total deaths were below 20.

Structure

Two datasets were joined, one listing month and weekday and their associated deaths by county; the other listing total population and total deaths by county. Crude death rate was also included and is essentially annual deaths per population.

TX_all %>% glimpse()
## Rows: 85,344
## Columns: 9
## $ state                    <fct> TX, TX, TX, TX, TX, TX, TX, TX, TX, TX, TX, T…
## $ county                   <fct> Anderson, Anderson, Anderson, Anderson, Ander…
## $ county_code              <fct> 48001, 48001, 48001, 48001, 48001, 48001, 480…
## $ year                     <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ month                    <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Feb, Feb, …
## $ weekday                  <fct> Sunday, Monday, Tuesday, Wednesday, Thursday,…
## $ daily_county_deaths      <chr> "Suppressed", "14", "Suppressed", "Suppressed…
## $ annual_county_deaths     <chr> "692", "692", "692", "692", "692", "692", "69…
## $ annual_county_population <dbl> 58057, 58057, 58057, 58057, 58057, 58057, 580…

Note

Due to suppressed data, any annual death rates calculated from the daily measure do not equal death rates from the summarised data. This is evidenced by taking the sum of all daily_deaths per county and comparing to reported annual_deaths in the dataset. Note no calculated deaths are greater than reported deaths, as suppressed data is not summed.

A potentially interesting measure emerges, that there are some counties which never experience a day with less than 9 deaths (annual_deaths == calc_annual_deaths)

## compare reported annual deaths to calculated annual deaths
compare <- 
  left_join(TX_pop_2021,
            TX_all %>% 
              filter(year == 2021) %>% 
              select(county_code, daily_county_deaths) %>% 
              group_by(county_code) %>% 
              mutate(daily_county_deaths = as.numeric(daily_county_deaths)) %>% 
              summarise(calc_annual_deaths = sum(daily_county_deaths, na.rm = TRUE))
            ) %>% 
  select(county_code, annual_deaths, calc_annual_deaths) %>% 
  mutate(lessOrEqual = ifelse(calc_annual_deaths <= annual_deaths, TRUE, FALSE))
  
comparison <- compare %>% 
  arrange(-annual_deaths)

DT::datatable(comparison)

EDA

Total deaths and population

How have total deaths in the state trended over the years compared to population?

TX_all %>%
  select(county, year, annual_county_population) %>% 
  distinct() %>% 
  group_by(year) %>% 
  summarise(TX_population = sum(annual_county_population)) %>% 
  mutate(year = as.factor(year)) %>% 
  left_join(
    TX_all %>%
      select(county, year, annual_county_deaths) %>% 
      distinct() %>% 
      group_by(year) %>% 
      summarise(total_deaths = sum(as.numeric(annual_county_deaths), na.rm = TRUE)) %>% 
      mutate(year = as.factor(year))
  ) %>% 
  pivot_longer(
    cols = TX_population:total_deaths) %>% 
  ggplot(aes(x=year, y=value, group=name, fill=name)) +
  # geom_col(position="dodge")
  geom_point()+
  geom_line()+
  facet_wrap(~name, scales="free_y") + 
  scale_y_continuous(labels=comma)

Annual patterns

Do certain months show patterns year after year?

gg_1 <- 
  TX_all %>% 
    mutate_if(is.character, as.numeric) %>% 
    mutate(year = as.factor(year)) %>% 
    group_by(year, month) %>% 
    summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
    # ungroup() %>% 
    ggplot(aes(x=month, y=deaths, color=year, group=year)) +
    geom_point() + 
    geom_line(alpha=.8) + 
    labs(
      title = "Deaths over time"
    ) + 
    scale_y_continuous(labels = comma) + 
  xlab("Month") +
  ylab("Deaths")

ggplotly(gg_1)

Annual adjusted

Every year the deaths increase, what about after normalizing against population?

# get TX population by year
TX_pop <- TX_all %>%
  select(county, year, annual_county_population) %>% 
  distinct() %>% 
  group_by(year) %>% 
  summarise(TX_population = sum(annual_county_population)) %>% 
  mutate(year = as.factor(year))

gg_2 <- TX_all %>% 
  mutate_if(is.character, as.numeric) %>% 
  mutate(year = as.factor(year)) %>% 
  group_by(year, month) %>% 
  summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
  left_join(TX_pop) %>%
  mutate(pct_pop_deaths = deaths/TX_population) %>% 
  ggplot(aes(x=month, y=pct_pop_deaths, color=year, group=year)) +
  geom_point() + 
  geom_line(alpha=.8) + 
  labs(
    title = "Deaths over time",
    subtitle = "Normalized for population"
  ) + 
  scale_y_continuous(labels = percent) + 
  xlab("Month") +
  ylab("Deaths (percent of state population")
ggplotly(gg_2)

Covid spike

Standard

Are the spikes seasonal or due to covid?

## visualize covid spike ----
gg_3 <- TX_all %>% 
  mutate_if(is.character, as.numeric) %>% 
  mutate(year = as.factor(year)) %>%
  group_by(year, month) %>% 
  summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
  left_join(TX_pop) %>%
  mutate(
    pct_pop_deaths = deaths/TX_population,
    date = ymd(paste0(year,"-", month, "-01"))
    ) %>% 
  ggplot(aes(x=date, y=pct_pop_deaths, color=year)) +
  geom_point()+
  geom_line(alpha=.8)+
  labs(
    title = "Deaths over time",
    subtitle = "Normalized for population"
  ) + 
  scale_y_continuous(labels = percent) + 
  scale_x_date(date_breaks = "4 months", date_labels = "%Y-%m") + 
  xlab("Month") +
  ylab("Deaths (percent of state population")+
  geom_rect(aes(xmin=ymd("2020-01-01"), xmax=ymd("2021-12-01"), ymin=0, ymax=.001), alpha=.005, fill='orange', color=NA) +
  theme(legend.position = "none")  
gg_3

Plotly

ggplotly(gg_3)

Monthly patterns

If we look at how deaths have trended over time, on a per month basis, can we see any useful patterns at all?

TX_all %>% 
  mutate_if(is.character, as.numeric) %>% 
  mutate(year = as.factor(year)) %>%
  group_by(year, month) %>% 
  summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
  left_join(TX_pop) %>%
  mutate(
    pct_pop_deaths = deaths/TX_population,
    date = ymd(paste0(year,"-", month, "-01"))
  ) %>% 
  ggplot(aes(x=year, y=deaths))+
  geom_point()+
  geom_line(group=1,alpha=.4)+
  facet_wrap(~month)

Weekday

Are certain weekdays more deadly?

TX_all %>% 
  mutate_if(is.character, as.numeric) %>% 
  select(county, year, month, weekday, daily_county_deaths) %>% 
  group_by(weekday) %>% 
  summarise(total_deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
  ggplot(aes(x=weekday, y=total_deaths))+
  geom_point()+
  geom_line(group=1)+
  scale_y_continuous(labels=comma)

And does this pattern hold true year by year?

TX_all %>% 
  mutate_if(is.character, as.numeric) %>% 
  select(county, year, month, weekday, daily_county_deaths) %>% 
  group_by(year, weekday) %>% 
  summarise(total_deaths = sum(daily_county_deaths, na.rm = TRUE)) %>% 
  ggplot(aes(x=weekday, y=total_deaths))+
  geom_point()+
  geom_line(group=1)+
  scale_y_continuous(labels=comma)+
  facet_wrap(~year, scale="free_y", ncol=1)